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Abstract: The relationship between the shape of a fitness landscape and the under- 
lying gene interactions, or epistasis, has been extensively studied in the two-locus 
case. Gene interactions among multiple loci are usually reduced to two-way inter- 
actions. We present a geometric theory of shapes of fitness landscapes for multiple 
loci. A central concept is the genotope, which is the convex hull of all possible allele 
frequencies in populations. Triangulations of the genotope correspond to different 
shapes of fitness landscapes and reveal all the gene interactions. The theory is 
applied to fitness data from HIV and Drosophila melanogaster. In both cases, our 
findings refine earlier analyses and reveal previously undetected gene interactions. 
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1. Introduction 

The term "epistasis" was coined by Bateson (1909) to describe tlie inter- 
actions among individual genes. The concept was introduced in the setting of 
Mendehan genetics, where epistasis gives rise to distorted Mendehan ratios of 
genotypes. In the context of statistical genetics, epistasis was originally called 
"epistacy" by Fisher (1918). Here, it arises when mapping discrete genotypes 
to continuous traits and refers to contributions to the phenotype that are not 
linear in the average effects of the single genes. The phenotypic trait of an or- 
ganism that drives the evolution of the population is the reproductive fitness of 
individuals, i.e., the expected number of offspring. For this trait, the genotype- 
phenotype mapping is called a fitness landscape (Gavrilets, 2004; Wright, 1931), 
and epistasis is a property of that landscape. 

For a genetic system of two biallelic loci, there is only one type of interaction, 
illustrated by the landscapes in Figure 11.11 and epistasis refers unambiguously 
to this interaction. However, the situation is more complex for more than two 
loci, because new interaction patterns arise. The current language for describing 
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Figure 1.1: The possible shapes of fitness landscapes on two biallelic loci. The genotope 
is the square. Its two triangulations correspond to negative and positive epistasis. 

gene interactions does not reflect this diversity (Phillips, 1998). Indeed, the 
common approach of investigating epistasis by analysis of variance (ANOVA) 
and expressing epistatic effects as the residuals of a linear regression of fitness 
on the genotypes (Cordell, 2002) does not account for the intrinsic combinatorial 
structure of the landscapes that generalize two-locus epistasis. 

For a specified set of genotypes that may involve multiple loci, we charac- 
terize all possible interactions among them. Our characterization is based on 
a geometric object, the genotope, which is the convex hull of the possible allele 
frequency vectors. The regular triangulations (De Loera et al., 2006) of the geno- 
tope encode the genotype interactions in the fitness landscape. The biological 
problem of studying genotype interactions for a fitness landscape is thus equiv- 
alent to the combinatorial problem of finding the shape of the fitness landscape, 
i.e., the triangulation of the genotope that is induced by the fitness landscape. 

In Sections 2 to 4, the mathematical concepts are introduced and illustrated 
in several examples. For simplicity, we focus primarily on the case of a population 
of haploid individuals (or, equivalently, of homozygous diploids). Nevertheless, 
our concepts and algorithms are not limited to haploids, and in Example 12.51 we 
briefly indicate the modifications necessary for diploids. 

In Section 5, we discuss the three-locus two-allele system, which amounts to 
classifying the 74 triangulations of the 3-cube. Table is the direct generaliza- 
tion of the list of possible shapes for two biallelic loci, shown in Figure ITTTl 

In Sections 6 and 7, we apply our method to two published fitness land- 
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scapes. Section 6 deals with a biallelic three-locus system in HIV (Segal et al., 
2004) and emphasizes the modeling of measurement error. Section 7 revisits the 
five-locus system in Drosophila melanogaster studied by Whitlock and Bourguet 
(2000). The shape of their Drosophila fitness landscape is a triangulation of the 
5-dimensional cube (the genotope) into 110 simplices, listed in Figure 17771 

Our discussion in Section 8 compares our approach with other studies of 
fitness landscapes. It also raises the challenge of determining the human genotope 
and its biologically relevant triangulations from suitable haplotype data. 

2. Populations and the Genotope 

We fix a finite alphabet S of size I, which labels the / different alleles at 
a genetic locus of interest. The elements of S may correspond to variants of 
a gene, to the nucleotides at a genome position (S = {A,C,G,T}, / = 4), or 
to the amino acids at a codon position in a gene (/ = 20). The biallelic case 
(S = {0,1},/ = 2) arises frequently in genomics, where it is known that most 
SNPs (single nucleotide polymorphisms) have two types. 

The allele frequencies at a certain locus define a probability distribution on 
the alphabet S. The set of all probability distributions on S is identified with 
the (/ — l)-dimensional standard simplex 

As = {{pi,P2,..., Pi) e[0,lf : pi+p2 + ---+pi = I}. 

We consider n loci, all with the same alphabet S of alleles, and we denote by 
the set of sequences of length n over S. The elements of are identified with 
the vertices of the n-fold direct product of simplices = (As)"- This product 
is a convex polytope (Ziegler, 1995) of dimension In — n having vertices. In 
particular, in the binary case, A^q is the standard n-dimensional cube. 

A genotype space is any subset of S". For a given genotype space Q, let Ug 
be the convex hull of all vertices in A^ that are indexed by sequences in Q. The set 
Ug is a subpolytope of Ag. We call Ug the genotope. A point v in the genotope 
is an n-tuple of allele frequencies. To be precise, if v = {vi, . . . , f„) G Ug C A^ 
then Vi € As represents the allele frequencies at locus i. 

Example 2.1. (Genotype lattices) 

In the study of directed evolution in (Beerenwinkel et al., 2006) the alphabet is 
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Figure 2.2: A three-dimensional genotope from the HIV data of Section 6. 

S = {0, 1} and the genotype space Q is the distributive lattice induced by an 
event poset £ with n elements. The genotope Hg is the order polytope of the 
poset £. If the event poset £ is empty then Q = {0, 1}", the genotope Ilg is the 
n-dimensional cube. The case n = 2 is Example 12.41 The case n = 3 is discussed 
in detail in Section 4. For an example of a non-empty event poset consider n = 3 
and £ = {2 < 3}, meaning that the second event has to occur before the third 
event can happen. The induced genotype lattice has six genotypes, 

g = {000,001,011,100,101,111}, (2.1) 

and the corresponding genotope Hg is a triangular prism. □ 

Example 2.2. (A genotope from HIV fitness data) 

Consider the genotype space 

g = {000, no, on, loo, loi, in}, (2.2) 

which differs from the one in 1)2. 1() only by the second genotype. This G does not 
form a distributive lattice, so it does not arise in the setting of (Beerenwinkel et 
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al., 2006). The genotope Ilg has six triangular faces and one square face (Figure 
12. 2|) . This genotope appears in our analysis of the HIV data in Section 6. □ 

By a population on Q we shall mean any probability distribution p on the set 
Q. For any genotype g G G, the coordinate pg of p represents the fraction of the 
population that is of genotype g. A population is a point in the population simplex 
Ag. The population simplex and the genotope are related via the marginalization 
map p, which maps a population p to its n-tuple of allele frequencies, 




If the population p consists of a single genotype g, then p is the unit vector whose 
coordinates are pg = 1 and ph = for all h E G\{g}- Its allele frequency vector 
p{g) is a list of n unit vectors, each of length The unit vector p{g)i is the 
vertex of the simplex As indexed by the i-th allele of the genotype g. Thus p{g) 
is a vertex of the genotope Ilg, and all vertices arise in this manner. Since the 
marginalization map p is linear, we conclude: 

Proposition 2.3. The genotope Ug equals the set of all possible n-tuples of allele 
frequencies that may arise from a population on the genotype space Q . 

Example 2.4. (Tetrahedron maps onto square) 

Let n = 2, S = {0, 1}, and consider the genotype space Q = {00, 01, 10, 11}. The 
set Ag of probability distributions on ^ is a tetrahedron in 4-space, 

= {(poo, poi, pio, pii) G [0, 1]" I Poo + m +P10 = 1} • 

The genotope Ilg is the square As x A^;. A population p is a point in the 
tetrahedron whose allele frequencies are given by the marginalization map 

/5(POO,P01,PlO,Pll) = ((POO +P01, PlO +Pll), (POO +P10, POl +Pll)) • 

If we identify the square Ilg = A|. with the convex hull of the four points (0, 0), 
(0,1), (1,0), (1,1) in the plane, then the marginalization map is the following 
projection (see Figure ESI) from the tetrahedron onto the square: 

/5(POO,P01,PlO,Pll) = (Pio +P11, Poi +Pll)- 
The two coordinates are the frequencies at the two loci of the allele "1" . □ 
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Figure 2.3: Population simplex and genotope for one diploid biallelic locus. 



Example 2.5. (From haploids to diploids) 

The mathematical set-up introduced so far is for haploids only. It can be extended 
to diploids using either of the following two equivalent approaches. The allele 
frequencies of diploid populations can be modeled by scaling each simplex in 
by a factor of two, so that the genotope sits inside (2A2)". Alternatively, we 
could replace n by 2n and then restrict to symmetric populations in A|.". 

To illustrate both approaches, we consider a single biallelic locus (/ = 2, 
n = 1) with genotypes aa, AA, aA, Aa, where we identify Aa with aA. The pop- 
ulation simplex is a triangle, and the genotope 2Aj] is a line segment whose end 
points are labeled aa and AA and whose midpoint is labeled by Aa = aA. This 
picture arises from A-£2 — > A|, by intersecting with a plane. The genotope is the 
diagonal segment in the square of Example 12.41 and the population triangle is a 
cross-section of the tetrahedron, depicted in Figure IT^ A point in the popula- 
tion triangle has three coordinates {paa, PaA, Paa) where paa is the frequency of 
genotype aa, paa that of genotype AA, and 2paA that of genotype Aa = aA. □ 

Returning to the general haploid model, we give an interpretation of the 
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fiber of a point in tlie genotope Hg under the tlie marginalization map p. The 
fiber over v € Hg is the following polytope inside the population simplex: 

p-\v) = {peAg:pip)=v}. (2.3) 

Remark 2.6. Ifv is any n-tuple of allele frequencies then its fiber p^^ (v) consists 
of all populations p which have the specified allele frequencies v. 

If all coordinates of v are non-zero, then p~^{v) is a polytope of dimension 

ciG) = dim(Ae) - dim(nc;) = \g\ - dim{Ug) - 1. (2.4) 

For the full genotype space = S" we have c(^) = P — n — 1. In particular, in 
Example 12.41 the dimension is one, and in Example 12.21 the dimension is two. 

Since the fibers (|2.3|) over the genotope Ilg characterize populations with 
constant allele frequency vectors, we can restrict our attention to these fibers 
whenever the allele frequencies are fixed. For example, if evolution acts on the 
genotype space by recombination, but without mutation and selection, then the 
allele frequencies of the population are constant for every generation. Hence, 
such evolutionary dynamics can be modeled by a dynamical system on p~^(v). 

Our geometric theory is adapted to the fact that, in biological systems, the 
set g of observed genotypes is usually significantly smaller than the number P 
of possible genotypes. Specifically, for binary data {I = 2) on many loci (say, 
n ^ 20), the genotope is never an n-cube, and its dimension is smaller than n 
due phenomena such as linkage. The frequently heard assertion that dimensions 
of genotype spaces and fitness landscapes increase exponentially in the sequence 
length is thus misleading. Even for large data sets, the complexity of the genotope 
Ilg can be expected to be in the tractable range of polyhedral algorithms. 

3. Fitness Landscapes and Interaction Coordinates 

A fitness landscape on a genotype space ^ is a function w. G Each co- 

ordinate Wg of w denotes the logarithm of the reproductive fitness of genotype g. 
The space of all fitness landscapes is the |^|-dimensional M- vector space 

In the study of gene interactions one considers linear combinations of the 
measured coordinates Wg. These linear combinations express epistatic effects. 
Certain collections of such linear forms play the role of interaction coordinates 
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on . Both the sign and the magnitude of these interaction coordinates are of 
interest when examining the fitness landscape w a biological system. 

Example 3.7. (Two triangles over a square) 

As in Example E31 let Q = {0,1}^ = {00,01,10,11}, so the genotope Ug is a 
square. A fitness landscape w is specified by the four numbers u;oO) 'Woi, wio, 
and wii, which we visualize as heights over the vertices of the square Ug. The 
interaction between the two loci is measured by the epistasis 

u = woo + wn-woi-wiQ. 

This defines three equivalence classes in the space of fitness landscapes according 
to whether the epistasis u is positive, negative, or zero. This trichotomy is 
depicted in Figure 11.11 This geometric view of genotype interaction leads us (in 
Section 4) to a natural concept of shapes of fitness landscapes for n > 2 loci. □ 

Let Cg be the subspace of consisting of all fitness landscapes w which have 
no interaction. Mathematically, w is in Cg if there is an affine-linear function 
on the genotope Ug whose values at the vertices are the Wg. We define the 
interaction space Ig as the vector space dual to the quotient of MP modulo Cg: 

Ig := {M^/Cg)*. 

An element of the interaction space Ig is a linear form in the unknowns Wg which 
vanishes identically on the subspace Cg. In Example 12.41 the space MP is four- 
dimensional, its subspace Cg is three-dimensional, and Ig is the one-dimensional 
space spanned hy u = wqq + wu — wqi — wiq. In general, the dimension of the 
interaction space Ig is the quantity c{Q) defined in Equation 12.41 

The interaction space Ig is spanned (redundantly) by a canonical set of linear 
forms which are known as the circuits. These are the linear forms whose support 
(i.e., the Wg which appear with non-zero coefficient) is non-empty but minimal 
with respect to inclusion. The circuits in Ig are unique up to scaling. Their 
number is usually larger than c{Q) but it is bounded above by (c(g)Li)- 

To see this upper bound, we note that Cg has dimension \G\ — c{Q). The 
circuits of Ig are computed by considering any set of \G\ — c{G) + 1 of the 
unknowns Wg. There exists a linear combination of these Wg which vanishes on 
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Cg. If this linear combination is unique (up to scaling) then it is a circuit. The 
converse holds as well: all circuits in Ig are found in this manner. 

Example 3.8. Let Q be the genotype space in Example 12.21 and Figure YITR We 
have \Q\ = 6 and c{Q) = 2, so our bounds say that the number of circuits is 
between two and six. In fact, this example has precisely four circuits: 

/ = wiQQ- wioi- Who + Will, 

9 = Wqoo - Won - Wioo + Will, 

n = Won + wioi + Who - Wqoo - "^wiii, 
s = wooo + wioi + Who - Won - 2wioo. 

The names /, g, n, and s were chosen to be consistent with the discussion of the 
ambient 3-cube in Example 13.91 The signs of these four circuits characterize the 
possible interactions in any fitness landscape on the genotyope space Q. Only 
eight of the 16 possible sign patters can occur, since the linear forms /, g, n, and 
s lie in the two-dimensional space Zg. See also Figure EEl D 

For certain genotype spaces Q, the interaction space Ig will have a distin- 
guished basis consisting of interaction coordinates. The choice of interaction 
coordinates depends on the genotope lig and on the situation of interest. A 
natural choice exists in the case of the n-cube, where / = 2 and Q = {0,1}'^. 
Consider any binary string iii2 ■■■in which has at least two entries that are 1. 
For such a string iii2' ■ ■ in we introduce the following element of Ig: 



11 1 

jl=0j2=0 jn=0 



The number of these linear forms equals c{G) = 2" — n — 1, and they form a basis 
of Ig. We call the Ui-^i^...i^ the interaction coordinates for the n-cube. 

The linear transformation above is the Fourier transform for the group 
(Z2)"'. It has appeared frequently in the mathematical biology literature. For 
instance, Feldman et al. (1974) and Karlin and Feldman (1970) used it to study 
equilibria of dynamical evolutionary systems on two and three binary loci. It 
appears in linkage analysis (Hallgn'msdottir, 2005), and in phylogenetics, where 
it gives rise to Hadamard conjugation (Hendy and Charleston, 1993). 
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Example 3.9. (The 3-cube) We discuss the interaction space for the three- 
dimensional cube (n = 3, Z = 2). The four interaction coordinates are 

-"no = (w^ooo +'ii^ooi) + (it^iio + it^iii) - (w^oio + w^oii) - (i^ioo + it^ioi) 

-uioi = (moo + mio) + (wioi + wm) - (u;ooi + w^oii) - {wioo + vJiio) 

uou = {wooo + wioo) + {won + wm) - {wqqi + wioi) - [wqiq + wno) 

Will = {wqoo + u^oii + wioi + wuo) - {wloo + Wow + Wool + Wui)- 

These four linear forms form a natural basis for the interaction space I^Q ijs. 
The interaction coordinate uhq measures the marginal epistasis between locus 1 
and locus 2, and similarly for -uioi and uqh. The last interaction coordinate um 
measures the three-way interaction among the loci. 

The 3-cube has twenty circuits in three symmetry classes. We abbreviate 
the circuits of the 3-cube by the first twenty letters of the alphabet. The first 
six circuits corresponds to the six faces of the 3-cube, and they measure the 
conditional epistasis between two loci when the allele at the third locus is fixed: 



a 
b 
c 
d 
e 
f 



= ^^110+^^111 

= uiio - uni 

= WlOl+^^lll 

= WlOl - ^^111 

= Uoil + ^^111 



Wooo 
Wool 

Wooo 

WolO 



WolO 
Won 
Wool 
Won 



Wloo + Wno 
wioi + wni 
wioo + Wioi 
Wno + wni 



= Wooo - Wool - wolo + Won 



■= uon - uni = Wioo - wioi - wno + wni- 
The second class of circuits relates marginal epistases of two pairs of loci: 



9 
h 

i 

j 
k 
I 



uno + ^tioi = "i^ooo - "i^oii - Wioo + wni 

uiio - uioi = Wool - woio - wioi + Wno 

Uno + ^^011 = Wooo - woio - wioi + i^in 

■"no - ''^oii = 'Wool - Won - wioo + wno 

uioi + Uon = Wooo - Wool - wno + wni 

■"loi-'i^oii = ■"^010 - ti^oii - ■u^ioo + w^ioi- 



Geometrically, the six circuits g, h, i, j, k, and I correspond to squares formed 
by vertices of the 3-cube that slice the 3-cube into two triangular prisms. 

The last class consists of eight circuits which relate the three-way interaction 
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to the total two-way epistasis, where signs are taken into consideration: 



m 


■= -uou - uioi - uuo - Ulll 


= m;ooi + M^oiO + M^ioo 


- Win 


- 2m;ooo 


n 


■= -Uou - uiQi - uno + Ulll 


= m;oii + M^ioi + Who 


- Wooo 


- 2m;iii 





:= Moil + ^^101 - ""no - ^^111 


= Woio + Wloo + Will 


- Wool 


- 2m;iio 


P 


:= Moil + Mioi - niio + Mm 


= Wooo + Won + Wioi 


- Who 


- 2m;ooi 


q 


:= Moil - Mioi + Miio - Mm 


= Wool + WloO + Will 


- woio 


- 2m;ioi 


r 


:= Moil - Mioi + Miio + Mm 


= Wooo + Won + Who 


- Wioi 


- 2m;oio 


s 


:= -Moil + Mioi + Miio + Mm 


= Wooo + wioi + Who 


- Won 


- 2u;ioo 


t 


:= -Moil + Mioi + Miio - Mm 


= Wool + WoiO + Will 


- Wioo 


- 2m;oii 



Geometrically, these correspond to the eight bipyramids in the 3-cube. □ 

The sign of the interaction coordinate u in the two locus case of ExamDle l3.7l 
determines the nature of the epistatic interaction. In the three-locus case, one 
may wish to record the signs of each of the twenty circuits a, b, . . ., t. For 
instance, the signs of a, b, c, d, e, and / specify all the conditional epistases 
of the fitness landscape w. The signs of the bipyramidal circuits m, n, . . ., t 
describe a three-way interaction which does not have a two-locus interpretation. 
The complete list of the twenty signs characterizes all the interactions among the 
genotypes, and also determines (but is not equivalent to) the shape of the fitness 
landscape. This is made precise in Proposition 14. 13l 

While the detailed study of the n-cube for small values of n provides a 
useful tool for data analysis, we wish to recall that biological genotopes may not 
be cubes. In such cases, there may not be any choice of interaction coordinates 
which is as nice and canonical as that coming from the Fourier transform. Fixing 
a basis for the interaction space will be a matter of choice and preference. What 
remains canonical and natural is the full collection of all circuits of the genotope. 

We conclude our discussion of gene interactions by comparing the proposed 
circuits to the more traditional approach of using ANOVA (Lindman, 1974). We 
illustrate the key difference between the two methods in the following example. 

Example 3.10. (Two DNA loci) Consider the genotype space Q = {A, C, G, 
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^ of two DNA loci. A fitness landscape on ^ is a matrix 

/ it'AA wac Wag Wat \ 



WCA Wcc WCG WCT 



w = 



Wga Wqc Wgg Wgt 



\ WjA W-YC WjG Wji I 



Let ujj, and w^j denote the row and column means, respectively, of and 
denote by it),, the grand mean. In the 2-way ANOVA analysis of the table 
one considers the 16 linear forms 



which measure the direction and amount by which each fitness value differs from 
our expectation based only on the row and column means. By contrast, there 
are 204 circuits in the interaction space including, for example, 

Wag - Wat - Wtg + Wn, 
-wac + Wag + wcA - w^ - wgg + wqt - wta + wjc, 
-waa + Wag - Wgg + wgt + u^ta - wn- 

Thus, the circuits measure deviation from linearity in a different, much finer way 
than ANOVA does. □ 

4. The Shapes of Fitness Landscapes 

We have defined fitness landscapes as discrete objects that assign one fitness 
value to each individual genotype g E G- However, in order to speak about 
"shape" or "curvature" of w : R —>■ Q, one needs a continuous object. This 
dilemma is resolved by considering populations p G Ag rather than individuals. 

The fitness of a population p is defined as the average fitness of all individuals 
in the population. Since the individuals are grouped into classes of identical 
genotypes, the population fitness can be written as the dot product 



geg 

This notion of population fitness leads to an extension of a discrete landscape 
w : Q ^ R to a function w on the full genotope Ug. The continuous landscape 



Wij - Wi» - W,j + w, 



e {A,C,G,T}), 



w ■ p 
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w : Hg M derived from w assigns to each point v in the genotope the maximum 
fitness among all populations p with these allele frequencies. We define 

w{v) := max \^p ■ w : p € p~^{v)^ for all v € Hg. (4-1) 

Computing the fitness value w{v) amounts to solving a c(t?)-dimensional linear 
programming problem, namely, to maximizing the linear functional p ^ p ■ w 
over the fiber p~'^{v). The continuous landscape w is the smallest convex function 
which has the same values as w on the vertices of Ilg. It is continuous and 
piecewise linear. Our classification of landscapes rests on the following remark. 

Remark 4.11. The domains of linearity of the piecewise linear function w are 
the cells in a regular polyhedral subdivision Ilg[w] of the genotope Hg. 

We refer to the book of De Loera et al. (2006) for an introduction to the 
geometry of polyhedral subdivisions. Remark 14.111 appears in (De Loera et al., 
2006, Chap. 2). We call the induced polyhedral subdivision Ilg [w] the shape of the 
fitness landscape w. For almost all fitness landscapes w E MP , the subdivision 
ng[w] will be a regular triangulation, i.e., a subdivision all of whose cells are 
simplices. We call such landscapes generic fitness landscapes. 

The simplices in the triangulation ng[u;] have a natural interpretation in 
terms of populations. For any n-tuple of allele frequencies v € Ilg, there is 
a unique fittest population p with p{p) = v. The genotypes that occur in this 
fittest population are the vertices of the simplex of ng[t(;] which contains v. Thus, 
knowing the shape of a fitness landscape w is equivalent to knowing all the fittest 
populations for w. For instance, in Example 13. 7[ if w has positive epistasis, then 
01 and 10 cannot coexist in a fittest population, so any fittest population consists 
either of genotypes in the triangle {00, 01, 11} or of genotypes in {00, 10, 11}. 

The number of shapes of fitness landscapes on a fixed genotype space G is 
finite. If G has few elements (say, less than twenty), then a complete list of all 
generic shapes can be compiled using the software TOPCOM (Rambau, 2002) which 
enumerates triangulations. For instance, the number of generic shapes of fitness 
landscapes on the cube {0, 1}" is two if n = 2, and it is 74 if n = 3 (Table 
In Section 5, we discuss the 74 shapes of fitness landscapes on {0, 1}^. 

In order to classify all shapes of fitness landscapes on a genotype space 
G, we need to list all polyhedral subdivisions of the genotope Hg. This set 
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of subdivisions is represented by the secondary polytope Tig (De Loera et al., 
2006, Chap. 5). The secondary polytope Tg has dimension c{Q), it hves in the 
space dual to M^, and its vertices are in bijection with the generic shapes. The 
higher-dimensional faces of the secondary polytope Tg correspond to non-generic 
shapes. Thus, the secondary polytope of a genotype space G represents all shapes 
of fitness landscapes on G and their neighborhood relations. In Example K-i.7| the 
secondary polytope Sg is a line segment. Its two vertices correspond to the two 
generic shapes, and the segment itself corresponds to the flat shape. 

In general, the secondary polytope can be defined as follows. Consider the 
average fitness of the fittest populations over all n-tuples of allele frequencies: 



1 



w^v) dv. 



vol(ng) Jng 

This integral is a piecewise linear function in the unknown fitness landscape 
w G MP . Two landscapes w and w' have the same generic shape precisely when 
they lie in a common domain of linearity of this piecewise-linear function. Thus 
for each generic shape, the function F is represented by a linear functional on M^, 
i.e., by a vector F in the dual space (M^)*. The coordinate Fg of this vector with 
respect to the standard basis on (M^)*, equals the probability that the genotype 
g appears in a fittest population. The secondary polytope Eg is the convex hull 
in (M^)* of these vectors F, one for each generic shape. 

Example 4.12. Let G = {000, Oil, 100, 101, 110, 111} be the genotype space 
in Example 12.21 and Figure EI2J The two interaction coordinates are 



""^OOO + Will 



wioi 
wmi 



WllQ 

wioo- 



The average fitness F(w) is the piecewise-linear function in the six fitness values 
wqu, wiQo, ifioi, wiiQ, and wm. The function Y{w) equals 

2^000 + 4^011 + ^wiQQ + 2i(;ioi + 2wiiq + 2it;iii, 
2woQQ + Awoii + 3wioo + 3^101 + SiiJiio + win, 
max < 3?x;ooo + '^wqu + wiqq + ^wiqi + ^wuq + wm, 
4u'ooo + 2u'oii + wioQ + 3wioi + 3wiio + 3wiii, 
Awqqq + 2^011 + 2^100 + 2^101 + 2wiiQ + Awm 
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Using the interaction coordinates, the average fitness can be rewritten as fohows: 

F{w) = {1/4) ■ max{0, X, 2x - y, X - 2y, -2y} 

+ wou + uiioo + (^^000 + wioi + Who + wm)/!. 

The five cases in this maximum correspond to the five possible shapes of the 
fitness landscape. The corresponding triangulations of Ilg are 

{{101, 111, 100, Oil}, {110, 111, 100, Oil}, {101, 000, 100, Oil}, {110, 000, 100, Oil}} 
{{101, 000, 100, Oil}, {110, 000, 100, Oil}, {110, 101, 100, Oil}, {110, 101, 111, Oil}} 
{{110, 101, 111, oil}, {110, 101, 000, 100}, {110, 101, 000, 011}has volume 2 } 
{{110, 101, 000, 100}, {110, 101, 000, 111}, {101, 000, 111, oil}, {110, 000, 111, oil}} 
{{101, 000, 111, 100}, {110, 000, 111, 100}, {101, 000, 111, oil}, {110, 000, 111, oil}} . 

The secondary polytope Eg is a pentagon whose vertices are labeled by the five 
triangulations. It is represented geometrically as the pentagon in the interaction 
space Ig whose directed edges are x, x — y, —x — y, —x, and 2y. See Figure EH 
for an illustration of the pentagon T,g in the context of HIV fitness data. □ 

We now explain the relationship between the shapes of fitness landscapes on 
Q and the circuits in the interaction space Ig. We define the circuit sign pattern 
of a fitness landscape w G to be the list which records the sign (positive, zero 
or negative) of the numerical value of each circuit at w. 

Proposition 4.13. The shape Ilg[w] of a fitness landscape w G MP is determined 
by its circuit pattern, but the converse generally does not hold. 

Proof. Both the circuit sign patterns and the shapes define a subdivision of MP 
into cones which fit together to form a fan. This ensures that we need only 
consider the generic case when all circuits are either positive or negative at w. 
Each such linear inequality can be written in the form 

where Gi and G2 are disjoint subsets of Q, and the Og and l3g are positive reals. 
The genotypes in Q2 cannot coexist in a fittest population, since they can be 
replaced by the genotypes in Gi, thus increasing population fitness while keeping 
the allele frequencies unchanged. Containing such a replaceable set Qi is the only 
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obstruction to being a simplex in the triangulation Ilg [w] . In other words, the sets 
Qi derived from circuits as above are the minimal non-faces of the triangulation 
ng[w]. This proves that ng[t(;] is determined by the circuit sign patterns at w. 

The second part of the proposition follows from Figure 16.61 There are pre- 
cisely four circuits, namely, x, y, x+y, x — y, and hence eight possible {+/— }-sign 
patterns. The eight sign patters map correspond to only five distinct shapes. □ 

5. Three-way Interactions 

We illustrate our generalization of the classical two-locus two-allele situation 
fFigure ri.lf) by examining the possible fitness shapes for three biallelic loci. Here, 
the genotype space is ^ = {000, 001, 010, Oil, 100, 101, 110, 111}, and the possible 
fitness shapes are exactly the 74 triangulations of the 3-cube (Table l5?T|) . 

Each triangulation is uniquely represented by its GKZ vector (De Loera et 
al., 2006), which is the vector F G (M^)* introduced prior to Example 14.121 The 
GKZ vector indicates for each vertex g £ Q the sum of the normalized volumes 
of all tetrahedra in the given triangulation that contain g. Equivalently, if the 
allele frequency vector is chosen uniformly at random from Ilg, then the g-th 
entry of the GKZ vector is the probability that genotype g appears in the fittest 
population. We refer to the shape of a fitness landscape for a three-locus two- 
allele system by its number (1 to 74) as appearing in the first column of Table I^TTl 

The 74 shapes label the vertices of the secondary polytope of the 3-cube 
and are listed in Table I^TTl As an example consider the 17th vertex of Sg: 

17/3 23346114 3e28g51b54d. 

This says that shape 17 has the GKZ vector (2, 3, 3, 4, 6, 1, 1,4). The third column 
says that shape 17 is adjacent to the shapes 3, 28, 51 and 54. The letters a, b, . . 
t refer to the list of circuits in Example 13.91 The GKZ vector (2, 3, 3, 4, 6, 1, 1, 4) 
differs from the GKZ vector of shape 3 by the circuit 

e = -Won +'"■111 = li'ooo - 'w^ooi - "w^oio + 'w^oii = (1,-1,-1,1,0,0,0,0). 

Similarly, shape 17 differs from shape 28 by the circuit —g = —uhq — uioi, 
from shape 51 by the circuit h = uuq — um, and from shape 54 by the circuit 
d = 'Uioi — ^111- The circuits e, 6, and d measure conditional epistasis between 



EPISTASIS AND SHAPES OF FITNESS LANDSCAPES 17 



#/T 


GKZ 


Out-edges 


#/T 


GKZ 


Out-edges 


1/1 


15515115 


3t4q5o6m 


38/4 


31355313 


39144^5 lc59d 


2/1 


51151551 


7s8r9pl0ii 


39/4 


31533513 


38144i53e60f 


3/2 


14436114 


mibl3dl7e 


40/4 


33155133 


42j45554a61b 


4/2 


14614314 


Igl2bl4fl8c 


41/4 


33511533 


43M6l55a62b 


5/2 


16414134 


lol5dl6fl9a 


42/4 


35133153 


40j45A;57e63f 


6/2 


34414116 


Irn28e29c31a 


43/4 


35311353 


41h46fc58c64d 


7/2 


41163441 


2s20a22c267 


44/4 


51333315 


38g39i65&68a 


8/2 


41341641 


2r21a23e27d 


45/4 


53133135 


40g42k66d69c 


9/2 


43141461 


2p24c25e306 


46/4 


53311335 


41i43k67770e 


10/2 


61141443 


2n32733d346 


47/5 


13356222 


Ildl3fo35f71e 


11/3 


13446213 


3612l47d51e 


48/5 


13623522 


127l4636d72c 


12/3 


13624413 


4W1148f53c 


49/5 


16323252 


157l6d37b73a 


13/3 


14346123 


3dl5j47b54e 


50/5 


22265331 


20c22a35e7l7 


14/3 


14613423 


47l6M8b55c 


51/5 


22356213 


Ilel7638c71d 


15/3 


16324143 


5rfl3j49f57a 


52/5 


22532631 


21e23a36c72d 


16/3 


16413243 


57l4h49d58a 


53/5 


22623513 


12cl8639e72f 


17/3 


23346114 


3e28^51b54d 


54/5 


23256123 


13el7d40a71b 


18/3 


23613414 


4c29i53b55f 


55/5 


23612523 


14cl8741a72b 


19/3 


26313144 


5a31^57d58f 


56/5 


25232361 


24e25c37a736 


20/3 


31264431 


7a2ll50c597 


57/5 


26223153 


15al9^3e73f 


21/3 


31442631 


8a20152e60rf 


58/5 


26312253 


16al9743c73d 


22/3 


32164341 


7c24750a6l7 


59/5 


31265322 


20f26a38rf71c 


23/3 


32431641 


8e257^52a62d 


60/5 


31532622 


21d27a39772e 


24/3 


34142361 


9c22j56e636 


61/5 


32165232 


22f26c40671a 


25/3 


34231461 


9e23h56c646 


62/5 


32521632 


23d27e41672a 


26/3 


41164332 


7f32559a61c 


63/5 


35132262 


24b30c42773e 


27/3 


41431632 


8d33z60a62c 


64/5 


35221362 


25b30e32d73c 


28/3 


43324116 


6el7g65c66a 


65/5 


52323216 


28c29c44b74a 


29/3 


43413216 


6cl8i65e67a 


66/5 


53223126 


28a31e45d74c 


30/3 


44131362 


9b34fc63c64e 


67/5 


53312226 


29a31c46f74e 


31/3 


44313126 


6al9k66e67c 


68/5 


61232325 


32d33f44a746 


32/3 


61142334 


10f26g68d696 


69/5 


62132235 


32b34f45c74rf 


33/3 


61231434 


10d27i687705 


70/5 


62221335 


33b34d46c747 


34/3 


62131344 


10b30k69770rf 


71/6 


22266222 


47e50f51d54659c61a 


35/4 


13355331 


36l37j47750e 


72/6 


22622622 


48c52d53755660e62a 


36/4 


13533531 


351377i48rf52c 


73/6 


26222262 


49a56b57758rf63e64c 


37/4 


15333351 


35j36h49656a 


74/6 


62222226 


65a66c67e68b69d70f 



Table 5.1: The 74 shapes of fitness landscapes of the three-locus two-allele system. The 
first column specifies the shape number and type, the second column the GKZ vector, 
and the third column the out-edges in the secondary polytope. 
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two loci when the third is fixed, so the secondary polytope tells us, for example, 
that shapes 17 and 51 differ only by such a conditional epistasis interaction. 

In the first column of Table I5.H one more number of interest is displayed 
after the shape number and separated by a slash, namely the interaction type. 
Shapes of the same interaction type differ only in the labeling of the vertices 
of the cube; in other words, they correspond to the same combinatorial type of 
triangulation. The six interaction types correspond to the six symmetry classes 
of triangulations of the 3-cube. These six classes are depicted in (De Loera et al., 
2006, Fig. 1.38) and in (Grier et al., 2006, Fig. 1), and we refer to these references 
for further mathematical background and generalizations to higher dimensions. 

Table \KA\ specifies all the shapes of fitness landscapes for three biallelic loci. 
It is useful to examine the six interaction types, and to consider their biological 
meaning. The two triangulations of type 1 are obtained if the three-way interac- 
tion uiii is either very large or very small. It consists of a central tetrahedron of 
volume two and four adjacent tetrahedra of volume one. The central tetrahedron 
can either use the genotypes with an even number of mutations, or those with 
an odd mutation count. Fitness landscapes of this type are linear on the central 
tetrahedron and on the adjacent tetrahedra. The curvature is such that fitness 
decreases more strongly than expected towards any of the genotypes that are not 
part of the central tetrahedron. These four genotypes are "sliced off", and the 
fittest populations avoid these genotypes whenever possible. Thus, the shape be- 
ing of type 1 means that the fittest populations include either all genotypes with 
an even number of mutations, or all genotypes with an odd number of mutations. 

Type 6 landscapes can be regarded as opposite to type 1. The four trian- 
gulations of this type are indexed by the four different diagonals of the 3-cube. 
Each diagonal induces a triangulation that consists of six tetrahedra arranged 
around the diagonal in such a way that each tetrahedron has exactly two adja- 
cent tetrahedra. Fitness landscapes of this type are linear on each tetrahedron 
and the curvature is such that any of the six tetrahedra has a higher fitness than 
expected from its two neighbors. No single genotypes are sliced off, as all entries 
of the GKZ vector are bigger than 1. For example, shape 74 uses the diagonal 
through 000 and 111. Hence the vertices of the tetrahedra correspond to different 
accumulative mutational pathways of length four from 000 to 111. For example. 
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000 — > 001 — > 101 — > 111 is one such pathway. Thus, the fittest populations 
involve all genotypes from exactly one of the six possible pathways. 

Types 1 and 6 represent the two extreme types of three-way interaction. 
Type 1 emphasizes the number of mutations irrespective of the particular con- 
text or pathway they occur in. By contrast, type 6 emphasizes the mutational 
pathway. Type 1 fitness shapes occur when intermediate types are fitter than 
expected irrespective of the specific intermediate genotype. Likewise, we expect 
type 6 fitness shapes whenever higher fitness values than expected occur only in 
specific genotypic contexts. The remaining interaction types (2 to 5) are inter- 
mediate and share features from both type 1 and type 6. 

It is important to note that the third column in Table [5T] gives the minimal 
set of linear inequalities that characterize the robustness cone for each of the 74 
shapes. These cones consist of all fitness landscapes that have the given shape. 
For instance, shape 74 is characterized by the inequalities a, b, c, d, e, / > which 
says that conditional epistasis is positive for any fixation of one of the three loci. 
While type 6 requires six inequalities, each of the other five types requires only 
four. For instance, being in shape 38 means that the conditional epistases c and 
d are positive while the marginal epistases g and / are negative. 

For the four-locus two-allele system {Q = {0, 1}^), it was shown by Grier et al. 
(2006) that the number of shapes is 87, 959, 448, and that these fall into 235, 277 
symmetry types. A table analogous to TableE3 listing one representative shape 



from each type, is available at http : //bio . math . berkeley . edu/ 4cube/ 



6. Positive Epistasis in HIV? 

In this section, we characterize the fitness landscape of HIV-1. We use the 
data published in (Segal et al., 2004), which is also described in (Bonhoeffer et al., 
2004). The data consists of 288 genotype-fitness pairs. The reported genotype 
is the HIV protein sequence composed of positions 4 to 99 of the protease (PRO) 
and positions 38 to 223 of the reverse transcriptase (RT). Fitness was measured 
as the number of offspring in a single replication cycle and was reported rela- 
tive to the fixed wild type strain NL4-3 on a logarithmic scale. Univariate and 
multivariate analyses show that mutation L90M in the protease, and mutations 
M184V and T215Y in the RT are the major determinants of fitness in this data 
set. The notation L90M means that at position 90, the amino acid leucine (L) is 
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Genotype 


Count 


Min. 


1st Qu. 


Median 


Mean 


3rd Qu. 


Max. 


000 


214 


0.1917 


1.4770 


1.6410 


1.5800 


1.7910 


2.0530 


001 


5 


0.5344 


0.6990 


1.1880 


1.1950 


1.7710 


1.7850 


010 


8 


-0.3355 


1.1440 


1.2960 


1.1330 


1.4870 


1.5310 


Oil 


8 


1.0000 


1.2500 


1.5150 


1.4300 


1.6360 


1.7240 


100 


7 


0.4771 


1.3470 


1.4650 


1.4410 


1.7890 


1.8750 


101 


13 


0.3010 


0.8673 


1.3420 


1.2320 


1.5840 


1.8870 


110 


11 


0.6021 


1.1610 


1.3700 


1.2940 


1.5370 


1.6920 


111 


22 


-0.4771 


0.9472 


1.1790 


1.0450 


1.3850 


1.7900 



Table 6.2: HIV random fitness landscape on the three amino acid mutations PRO L90M, 
RT M184V, and RT T215Y. The mean values (in bold) are used for significance testing. 



replaced by methionine (M). We therefore analyze the fitness shape of this three- 
locus two-allele system. We identify the subsets of {L90M, M184V, T215Y} with 
binary strings of length three. For instance, the string 010 in the third row in 
Table l6?2l is the genotype carrying only mutation M184V. 

The fitness values were measured for different viruses, some of which share 
a mutational pattern on the three selected loci. Thus, instead of a single fitness 
value for each genotype, we have a distribution. A random fitness landscape on a 
genotype space ^ is a collection of \Q\ continuous random variables W = (Wg)g£g. 
We think of a realization of Vl^ as a real- valued function w: Q — > M on the 
genotype space. Thus, W takes values in MP . In general, stochastic fluctuations 
in W can arise from measurement noise in assessing the reproductive fitness 
experimentally and from biological variation that is not linked to the n genetic 
loci. The HIV random fitness landscape is summarized in Table 

Following Bonhoeffer et al. (2004) and Sanjuan et al. (2004), we first examine 
the total marginal two-way epistasis 

E = niio + -uioi + noil, 

with Ujjfc defined in Example 13.91 Computing E means pooling epistasis esti- 
mates obtained from the three pairs of loci. We use randomized fitness values to 
estimate E under the null hypothesis of no epistasis. Unlike in (Bonhoeffer et al., 
2004), we do not find the observed positive value E = 0.025 to be significantly 
greater than zero (P > 0.35). This discrepancy may reflect the limited statistical 
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Circuit 


Pair 


Context 


Cond. epist. 


P- value 


a 


90-184 


T215 


0.300 


0.110 


b 


90-184 


215Y 


-0.421 


0.059 


c 


90-215 


M184 


0.175 


0.230 


d 


90-215 


184V 


-0.545 


0.013 


e 


184-215 


L90 


0.682 


0.008 


f 


184-215 


90M 


-0.039 


0.410 



Table 6.3: Conditional 2-way epistasis in HIV. The P-value denotes the fraction of 
epistasis values in the bootstrap sample that are lower (in the case of negative epistasis) 
or higher (in the case of positive epistasis) than the observed mean conditional epistasis. 
Circuits are labeled as in Example 13.91 

power of our analysis, which is based on a smaller data set and on only 3 loci, 
although Bonhoeffer et al. found positive epistasis to be more pronounced when 
restricting to the most influential sequence positions. 

Rather than marginalizing, we next consider two-way epistasis conditional 
on the value at the third locus, i.e., we determine the pattern of epistasis for each 
pair of loci in the context of the third locus (Table . For all three pairs, we 
find positive epistasis conditioned on the wild type allele at the third position, 
and negative conditional epistasis otherwise. Two out of six comparisons reached 
statistical significance, including one case of negative epistasis. Figure lOl shows 
the empirical null distributions (grey histogram bars) and, in addition, the em- 
pirical distributions of the circuits a, b, c, d, e, and / (as defined in Example 13. 9 j) 
that represent conditional epistasis (clear histogram bars). The distributions of 
circuits are obtained by resampling fitness values for the same genotype. They 
reflect the uncertainty in the shape of the fitness landscape due to multiple dis- 
cordant measurements. We conclude that the HIV fitness landscape on the three 
selected loci is insufficiently described as "positively epistatic" . 

We now characterize the shape of the HIV fitness landscape by estimating 
the distribution on the 74 shapes in Table I^HI induced by the data in Table IH?^ 
We use a resampling scheme to estimate the distribution and shape of W. The 
distribution of ng[II^] is shown in Figure IH^T a). The dominant shapes are ^7 
(frequency 20.9%), #2 (20.7%), #26 (15%), and #61 (11.3%). In Figure ES^b), 
we randomized the fitness values. Comparing the two histograms shows that the 
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(a) 



(b) 



CO 
c 

CD 

Q 



(d) 



(e) 



(f) 



Epistasis 

Figure 6.4: Conditional two-way epistasis in HIV, grouped according to circuits a, . . . , f. 



observed shape distribution is very different from a randomly chosen landscape 
using the same values. The histogram in Figure EIHtc) results from sampling 
fitness values uniformly at random from [0, 1] and is similar to the one in subfigure 
(b). Figure 1^31 can be used to detect differences in fitness shape between random 
landscapes, and to identify single shapes that appear with high probability. 

The dominant shapes of the HIV fitness landscape are very similar and have 
certain features in common. For example, the GKZ vectors that correspond to 
shapes 2, 7, 10, 26, and 32 all share a coordinate 1 in both the second and third 
position. These five shapes account for 61.5% of the probability mass. They all 
slice off the two genotypes 001 and 010, which correspond to the single mutants 
{M184V} and {T215Y}, respectively. Both mutations are known to reduce the 
fitness of HIV. Indeed, M184V develops shortly after initiation of therapy with 
the antiviral drug lamivudine in most patients. Although M184V carrying viruses 
are resistant to lamivudine, administration of the drug is often continued in order 
to maintain the mutation and its fitness impairing effect (Wainberg, 2004). 

This suggests studying the fitness landscape on the subset G = {000, 110, 
Oil, 100, 101, 111}. The secondary polytope of is a pentagon whose vertices 
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(a) HIV data 
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(b) randomized HIV data 
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(c) randomly generated data 



liiiiii. 



Shape 



Jill 



— I — I — I — I — I — I — I — I — I — I — I — I — I — I 
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Figure 6.5: Three-way epistasis in HIV, analyzed using the 74 shapes in Table 15.11 
Shape distribution of (a) the HIV random fitness landscape on mutations PRO L90M, RT 
M184V, and RT T215Y; (b) the same fitness values randomly assigned to genotypes; and 
(c) a random landscape in which fitness values are identically and uniformly distributed. 
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Figure 6.6: The secondary polytope of the genotope in Figure It appears as a face 
of the secondary polytope of the 3-cube and its vertices and edges are labeled as in 
Table IHTI The underlying genotype space is defined in Example 12.21 and represents the 
HIV fitness landscape shapes that slice off the single mutants {M184V} and {T215Y}. 
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correspond to the shapes 2, 7, 10, 26, and 32, i.e., the five triangulations of the 
genotope in Example l2.2l and Figure|2I21 Fignre HOH shows this pentagon. Vertices 
and edges are labeled as in Table I5TT1 The pentagon reveals that shapes 26 and 
7, and 32 and 10, differ only by the circuit /. This circuit represents conditional 
epistasis between the two RT loci 184 and 215 in the context of the protease 
mutation DOM fTable IfOl Figure ESf)- The other edges are the circuits g, n, 
and s. The circuit g relates the marginal epistasis of the two pairs (PRO 90, RT 
184) and (PRO 90, RT 215). The circuit n compares the total two-way epistasis 
E to the three-way interaction um, and s compares the epistasis in (RT 184, 
RT 215) to the remaining pairs plus the three-way interaction. 

In summary, analysis of the HIV fitness data has revealed a specific pentagon 
in the boundary of the (four-dimensional) secondary polytope of the 3-cube. The 
three dominant shapes 7, 2, and 26 in the random fitness landscape are adjacent 
on the pentagon, and they correspond to three of the five triangulations of the 
genotope in Figure 12.21 This geometric characterization of the fitness landscape 
captures more adequately the interactions underlying the HIV fitness data. 

7. Synergistic Epistasis in Drosophila 

Turning to a higher-dimensional genetic system, this section illustrates our 
concepts and methods with the fitness data for Drosophila melanogaster reported 
by Whitlock and Bourguet (2000). Those authors considered five genetic loci, 
denoted px/sp, b, ca, e/sr and h, for which they created a set of 32 = 2^ ho- 
mozygous lines fixed for all the possible different combinations of mutations. We 
characterize the shape of the fitness landscape they measured, and we discuss 
their statistical analysis in light of our findings. Here, I = 2 and the geno- 
type space is Q = {0,1}^. The binary strings g ^ Q represent the subsets of 
|px/sp, b, ca, e/sr, h}. For instance, 01011 represents the mutant b/e/sr/h. In 
what follows, we list the elements of Q in the following order: 



00000 


10000 


01000 


00100 


00010 


00001 


11000 


10100 


10010 


10001 


01100 


01010 


01001 


00110 


00101 


00011 


11100 


11010 


11001 


10110 


10101 


10011 


OHIO 


01101 


01011 


00111 


11110 


11101 


11011 


10111 


01111 


11111 



The first column in (Whitlock and Bourguet, 2000, Tab. 1) consists of the fitness 
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values Wijkim which measure the relative reproductive fitness of each strain: 



-0.232 


-0.850 


-0.312 


-0.214 


-0.847 


0.507 


-0.238 


-0.490 


-1.030 


0.232 


-0.968 


-1.338 


-0.034 


-1.47 


-0.739 


0.2176 


-0.712 


-1.820 


-0.529 


-0.786 


-0.195 


-0.641 


-1.945 


-0.047 


0.0264 


-1.296 


-2.446 


-1.973 


-1.180 


-1.024 


-1.856 


-4.560 



These values define the fitness landscape w : ^ — > M, e.g., tfoiioo = —0.968. 

Whitlock and Bourguet's analysis of epistasis begins with an examination of 
the ten two-way interactions. They find that six of the 10 pairs show significant 
interaction. In all of them the direction of the effect is negative (synergistic 
epistasis). As we show below, each of the two-way interactions depends on the 
genotypes at the remaining loci, and a refined analysis provides more information. 

The shape ng[t(;] of the Drosophila fitness landscape is a triangulation of 
the 5-cube Hg. It consists of the 110 maximal simplices displayed in Figure 17771 
Although the figure does not convey an intuitive image of the shape, it pre- 
cisely characterizes the interactions. For verification, we computed the trian- 
gulation twice, once using the computer algebra software Macaulay 2 (Grayson 
and Stillman, 1999) and once using the geometry software Polymake (Gawrilow 
and Joswig, 2001), both in less than one second running time on a standard PC. 

Returning to the question of synergistic epistasis, we examine each pair of 
loci when we fix the values (0 or 1) at the remaining three loci. For example, the 
first pair of loci (1, 2) (i.e., (px/sp, b)) has positive epistasis twice, namely when 
there is either no other mutation or only the mutation 3 (ac). However, as soon 
as either of mutations 4 (e/sr) or 5 (h) occurs, the epistasis between px/sp and 
b becomes negative. Algebraically, if we write 

Oi**klm = Wooklm + Wuklm — Woiklm — WiQklm 

then a**ooo = 0.692 and a^,*ioo = 0.532 are positive while the other six epistatic 
interactions have negative numerical values: 

«**ooi = —0.220 a**oio = —0.299 a**oii = —0.3478 
0**101 = —2.470 a**iio = —1.185 a**iii = —2.976. 
Biologically, this analysis confirms the negative total marginal two-way epistasis 
described in (Whitlock and Bourguet, 2000), however it also reveals precise infor- 
mation about the conditional epistasis. Geometrically, an analysis of all two-way 
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Figure 7.7: The shape of the Drosophila fitness landscape. 
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interactions involves examining the 80 two-dimensional faces of the 5-cube. Of 
these 80 squares, we find that precisely 26 have positive epistasis. We list the 
numerical values of the 26 positive interactions grouped by pairs of loci: 



1.1306, a*i*oo 



(1,2) 


a**ooo 


= 0.692, a**ioo = 


0.532, 






(1,3) 


a*o*oo 
0.182, 


= 0.342, a*o*oi = 


-- 0.819, 


a*o*io 


= 0.867, 


(1,4) 


a*oo*o 


= 0.435, a*oi*o = 


0.960, 






(1,5) 


a*ooo* 


= 0.343, a*oio* = 


0.820, 






(2,3) 


oo**oi 


= 1.233, ao**io = 


0.016, 






(2,4) 


ao*o*i 


= 0.3498, ao*i*o = 


= 0.279, 


ai*o*i 


= 0.222, 


(2,5) 


ao*oi* 


= 0.2998, ao*io* = 


= 1.446, 


ai*oi* 


= 0.251, 


(3,4) 


aoi**o 


= 0.049, aio**i = 


0.044, 






(3,5) 


ooi*o* 


= 0.643, 








(4,5) 


oooo** 


= 0.3256, aooi** = 


= 0.699, 


"010** 


= 1.0864 



Considering that 54/80 = 0.675 is higher than the fraction 0.6 of negative two- 
way interactions identified by Whitlock and Bourguet, and that many of the 
positive epistatic interactions are very small, one can argue a stronger case for 
synergistic epistasis. Importantly, our analysis also reveals exactly which pairs of 
loci can have positive or negative epistasis, and how this depends on the patterns 
at the other loci. The pair (1,3) (i.e., (px/sp, ca)) has the largest number of 
positive epistasis patterns, namely five, while the pair (3,5) (i.e., (ca, h)) has 
only one. In order to assess the statistical significance of these interactions, it is 
imperative to have replicates of the fitness measurements (cf. Section 6). 

The shape of the fitness landscape also reveals information about the fittest 
population. To see this, we return to the geometry in Figure 17.71 The poly- 
hedral complex dual to the triangulation, is referred to as the tight span (Grier 
et al., 2006). For our data, the tight span is three-dimensional, with f- vector 
(110, 214, 127, 22). Besides the 22 three-dimensional cells, the tight span of ng[?i;] 
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has ten maximal cells of dimension 2 and two maximal cells of dimension one. 
These two "tentacles" correspond to the genotypes 10000 and 11111 which are 
"sliced off" in the triangulation, i.e., each of them lies in only one maximal 
simplex. The GKZ vector of the triangulation Ilgfu;] equals 1/120 times 



Recall that the entry of this vector indexed by gcnotopc g ^ Q is the probability 
that g appears in the fittest population for a randomly chosen allele frequency 
vector. For instance, for the strain 10110 = px/sp/ca/e/sr that probability is 
83/120 = 69.2%. The triangulation ng[w] can also be written in a way which 
is complementary to the list of 110 maximal simplices. Namely, there are 332 
minimal non-faces of ng[i(;], 31 non-triangles, and the one non-tetrahedron 



The shape of the fitness landscape implies that these four Drosophila mutants 
cannot coexist in a maximally fit population, however any three of them can. 

8. Discussion 

Our description of the shape of fitness landscapes in terms of triangulations 
of the genotope directly reveals information about all the interactions among 
the genotypes. In the case of many organisms, including humans, polymorphism 
occurs at single nucleotides in the genome (SNPs), and is usually of two types. 
Thus, even though humans are diploid, and in principle there could be 16 possible 
alleles at a polymorphic site, there are usually only 1 = 2. It is important to note 
that even though the number of SNPs is in the millions, the human genotope 
is determined only by the genotypes occurring in the population. The linkage 
disequilibrium structure of the human population (The International HapMap 
Consortium, 2005) suggests that the dimension of this genotope is far smaller 
than would be suggested by the number of SNPs. Thus, there is hope that in the 
future, with measurements of fitness one can learn about populations by examin- 
ing the shapes of fitness landscapes on the human genotope. In the meantime, the 
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mathematics we have developed will be useful for studying interactions among 
genotypes in small mutation studies. 

Another interpretation of the triangulations of the genotope is in terms of 
the genotypes that can occur in maximally fit populations. Such populations 
must consist of genotypes that label one simplex in the triangulation. In other 
words, the description of the shape of fitness landscapes that we have provided 
is fundamental for understanding the genotypes of populations that evolve by 
recombination but without mutation. In the case of populations that evolve 
with mutation, but without recombination, a complementary analysis of fitness 
landscapes in terms of linear extensions of the genotope (viewed as a poset) is 
provided by Weinreich (2005). An understanding of the relationship between 
these approaches will lead to a deeper understanding of populations that evolve 
by recombination and mutation. 

What our study and Weinreich's have in common is that they fall into the 
domain of non-parametric statistics. In contrast to other papers on fitness land- 
scapes, including (Karlin and Feldman, 1970) and (Kondrashov and Kondrashov, 
2001), we do not make a priori assumptions about the fitness landscape. In par- 
ticular, our analysis of the data in Sections 6 and 7 is not based on any choice 
of model for w. On the other hand, the geometry of the secondary polytope 
interfaces well with Bayesian statistics, because any family of distributions on 

induces a family of distributions on the finite set of possible shapes. 
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